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Abstract 



This study compared the accuracies of four differential item functioning (DIF) estimation 
methods, where each method makes use of only one of the following: raw data, logistic 
regression, loglinear models, or kernel smoothing. The major focus was on the estimation 
strategies’ potential for estimating score-level, conditional DIF. A secondary focus was on 
assessing the accuracy of strategies’ overall DIF effect sizes and statistical significance tests. A 
real data simulation was used to evaluate the estimation strategies with 6 items representing DIF 
and No DIF situations, and with 4 sample size combinations for the reference and focal group 
data. Results showed that the logistic regression estimation strategy was the most highly 
recommended strategy in terms of the bias and variability of its estimates and the power of its 
statistical significance test. The loglinear models strategy had flexibility advantages, but these 
advantages only offset the greater variability of its estimates and its reduced statistical power 
when sample sizes were large. The kernel smoothing estimation strategy was the least accurate 
of the considered strategies due to estimation problems when the reference and focal groups 
differed in overall ability. 

Key words: DIF, kernel smoothing, loglinear models, logistic regression 
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While the psychometric literature has defined differential item functioning (DIF) as a 
performance difference between examinee groups at one level of ability (Dorans & Holland, 

1993; Lord, 1980; Shepard, 1982), considerable research has focused on developing and 
comparing DIF detection methods that summarize DIF across a total range of ability (Dorans & 
Kulick, 1986; Holland & Thayer, 1988; Kristjansson, Aylesworth, McDowell, & Zumbo, 2005; 
Roussos & Stout, 1996; Shealy & Stout, 1993; Swaminathan & Rogers, 1990; Zumbo, 1999; 
Zwick, Thayer, & Lewis, 2000). This work usually focuses on overall statistical significance 
tests of summary DIF indexes and, to a lesser extent, on the use of summary DIF indexes as 
overall effect sizes. Due to the potential of all summary measures to oversummarize in special 
circumstances (to be described below), effect sizes and significance tests of overall DIF may 
benefit by being supplemented with assessments of conditional, ability-level DIF. The purpose of 
this study was to compare the accuracies of four DIF estimation strategies for estimating 
conditional DIF (raw data, logistic regression, loglinear models, and kernel smoothing). 

Assessing Differential Item Functioning (DIF) 

The assessment of DIF is a determination of whether a studied item, Y, performs 
differently for reference examinees, R, and focal examinees, F, conditioned on the M levels of a 
variable that measures reference and focal examinees’ overall ability, X m . In this study, Y is 

dichotomously scored. X m denotes an observed test score that excludes Y and all items 

containing extensive DIF, or C-DIF (Dorans & Holland, 1993). 

The extent of item Fs DIF can be assessed by determining if the reference and focal 
conditional expected scores differ for any of the M levels of X m , 

Conditional DIF m = E (Y Fm ) - E (Y Km ) * 0, m = 1, M ( ^ 

In typical DIF assessments, the M differences in (1) are summarized rather than individually 
evaluated. One common DIF summary measure is a focal- weighted average of (l)'s C-DIF 
estimates, 






m / , n Fm 
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where n Fm denotes the number of focal examinees at X m . The DIF summary statistic in (2) is 

referred to as a standardized expected score difference (i.e., standardized E-Dif; Dorans & 
Schmitt, 1993). The standardized E-Dif is used mostly as an effect size measure of overall DIF, 
but since an estimate of its standard error is available (Dorans & Holland, 1993), it can also be 
used as a statistical significance test. 

Potential difficulties with the standardized E-Dif measure are that it can downplay DIF in 
easy and hard items (Dorans & Holland, 1993, p. 59) or in items exhibiting large degrees of 
nonuniform DIF. In addition, the standardized E-Difs most frequently described weighting 
n 

strategy, , may not be the most appropriate for particular purposes, such as evaluating 

Z.">m 

m 

DIF in the proximity of potential cut scores. To address these issues, it can be useful to 
supplement overall effect size and significance test DIF assessment by also assessing the M 
differences in (1) with respect to magnitude and with respect to the M conditional standard 
errors, 



SE{E(Y rm ) - E(Y „ )) = JVar(E(Y Pm )) + Var(E(Y Rm )) _ (3) 

where the Var(E(Y )) terms are the estimated variances of the expected item scores, E(Y ) . The 
assessment of (1) and (3) using different DIF estimation strategies is the major focus of this 
study. 

Differential Item Functioning (DIF) Estimation Strategies 

This section summarizes the raw, logistic regression, loglinear models, and kernel 
smoothing DIF estimation strategies of interest in a general overview and as applied to a specific 
DIF example. Specific details are given in Appendixes A, B, C, and D for how each estimation 
strategy can be used in (1), (2), and (3) for estimating conditional DIF, conditional standard 
errors, and the standardized E-Dif measure, and for overall statistical significance tests. 

The use of raw data for estimating conditional means and variances in DIF (Appendix A) 
has been described for estimating the standardized E-Dif measure, for plots of conditional 
differences (Dorans & Holland, 1993; Dorans & Kulick, 1986), and also for overall statistical 
significance tests in the related simultaneous item bias test (SIBTEST) framework (Shealy & 



2 




Stout, 1993). Raw data offers the most direct approach to DIF estimation and has the least 
potential for model misspecification error of the strategies considered in this study. The use of 
raw data produces conditional DIF estimates that are relatively unstable in terms of sampling 
variability, a feature that could make the estimates less useful than those based on other 
strategies. 

The application of logistic regression procedures to DIF assessment (French & Miller, 
1996; Jodoin & Gierl, 2001; Kristjansson et al., 2005; Swaminathan & Rogers, 1990) involves 
predicting the probability of a correct response on Y using logistic curves based on X m , 

membership in the reference or focal group, and the interaction of group membership and X m 

(Appendix B). Logistic regression has been studied as an overall significance test and has 
received attention for its estimates of conditional DIF (French & Miller, 1996) and effect sizes 
(Jodoin & Gierl, 2001; Zumbo, 1999). As a significance test, logistic regression has been shown 
to be a powerful test relative to other strategies (Swaminathan & Rogers, 1990), especially for 
detecting levels of DIF that are not the same at each level of X m (i.e., nonuniform DIF). The 

accuracy of logistic regression’s conditional DIF estimates is less clear, as its imposition of 
logistic curves is the strongest of assumptions made on the data of all the DIF strategies 
considered in this study, perhaps increasing its potential for biased estimation (Hanson & 
Feinstein, 1995; Ramsay, 1991). 

The polynomial loglinear models assessed in this study were proposed by Hanson and 
Feinstein (1995). This estimation strategy is based on identifying DIF in terms of differences in 
four discrete frequency distributions of X m : the two frequency distributions of the reference 
group that gets Y correct and incorrect, and the two frequency distributions of the focal group 
that gets Y correct and incorrect (Appendix C). Polynomial loglinear models, one of many 
loglinear modeling proposals for assessing DIF, are iterative and more flexible versions of 
Mantel-Haenszel (Holland & Thayer, 1988), are more parsimonious than the “saturated” 
loglinear models described in Mellenbergh (1982), and have an observed score focus rather than 
other Rasch-focused loglinear models (Kelderman, 1984). Hanson and Feinstein provided 
demonstrations of the use of polynomial loglinear models for overall significance tests and for 
conditional DIF estimates. Conditional standard errors were not described. The Hanson and 
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Feinstein study demonstrated that loglinear models are more flexible and make fewer 
impositions on the data than logit models (such as logistic regression models). 

A final approach that is considered for assessing overall and conditional DIF is based on 
kernel smoothing (Ramsay, 1991). Kernel smoothing employs weighted averaging to reduce 
fluctuations in raw data estimates (Appendix D). This study employs kernel smoothing to 
separately smooth the raw focal and reference E(Y m ) ’s, an approach that is routinely used at 
ETS to assess conditional DIF and also to assess items’ nonparametric response curves. This 
version of kernel smoothing differs from prior versions employed in studies of kernel smoothing 
applications to DIF that are computationally intensive, nonparametric-IRT-based procedures 
(Douglas, Stout, & DiBello, 1996; Gierl & Bolt, 2001; Lyu, Dorans, & Ramsay, 1995; Ramsay, 
2000 ). 

Example. An example is presented to illustrate the distinguishing features of the four 
DIF estimation strategies. This example is based on the population data of one of the DIF items 
featured in this simulation study: the Science 1 item. This item was flagged as a conditional DIF 
item favoring the male reference group (N = 34,336) as compared to the female focal group (N = 
18,560). More specific information about the DIF context of this item is described in this study’s 
section, Raw Population Data and Their Population Differential Item Functioning (DIF) 
Statistics. 

The standardized E-Dif values and overall significance tests based on the four DIF 
estimation strategies of interest are presented in Table 1. The standardized E-Dif values based on 
raw data, logistic regression, and loglinear models are identical when rounded to their first three 
decimal places (-0.140). The standardized E-Dif value based on kernel smoothing is somewhat 
different from those of the other three estimation strategies (-0.148). All four estimation 
strategies indicate statistically significant overall DIF. 

The conditional DIF and +/- 2 estimated standard error bands for the four estimation 
strategies are presented in Figures 1 to 4. The figures suggest that the Sciencel item’s DIF is 
nonuniform (i.e., the level of DIF is not the same across the score levels of X m ). Specifically, 

DIF is shown to be large and statistically significant for the low-to-middle scores of X m but 
close to zero (i.e., no DIF) and possibly statistically insignificant for the higher scores of X m . 
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These nonuniform, X m -varying conditional DIF estimates are missed when the focus is only on 
the overall standardized E-Dif values and significance tests (Table 1). 

Table 1 

Comparing the Four Differential Item Functioning (DIF) Estimation Strategies’ Overall DIF 
Assessments in the Study’s Population Data (Sciencel Item) 



Method 


Standardized E-Dif 


Significance test statistic 


Raw data 


-0.140 


z = -31.03* 


Logistic regression 


-0.140 


j 2 = 1,146.62* (df= 2) 


Loglinear models 


-0.140 


j 2 = 1,167.89* (df = 5) 


Kernel smoothing 


-0.148 


z = -33.78* 



*£< .05. 



Sciencel Item 

Conditional DIF and Standard Errors 
Raw Data 



♦ DIF — +/-2SES 



0.4 

0.2 




0 10 20 30 40 50 60 70 



X 



Figure 1. Sciencel item: Raw data for conditional differential item functioning (DIF) and 
standard errors (SE). 
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Figures 1 to 4 illustrate how the four DIF estimation strategies differ: The conditional 
DIF estimates based on the raw data exhibit large fluctuations and relatively wide standard error 
bands (Figure 1), while the logistic regression method has narrow standard error bands and 
conditional DIF estimates that disagree with the raw data’s no DIF suggestion at the highest X m 

scores (Figure 2 vs. Figure 1). The loglinear model (Figure 3) and kernel smoothing (Figure 4) 
estimation strategies appear to reflect the trends in Figure l’s raw data conditional DIF estimates 
more closely than the logistic regression method, with the standard error bands based on the 
loglinear model being wider than those of the kernel smoothing method at the lowest X m scores. 



Sciencel Item 

Conditional DIF and Standard Errors 
Logistic Regression 

V DIF — +/-2SEsT 
0.4 



0.2 



5 0 



- 0.2 



... ♦♦♦♦♦♦***** 





0.4 

0 10 20 30 40 50 60 70 

X 



Figure 2. Sciencel item: Logistic regression for conditional differential item functioning 
(DIF) and standard errors (SE). 



6 



Sciencel Item 

Conditional DIF and Standard Errors 
Loglinear Models 



« DIF — +/-2SES 




X 



Figure 3. Sciencel item: Loglinear models for conditional differential item functioning 
(DIF) and standard errors (SE). 



Sciencel Item 

Conditional DIF and Standard Errors 
Kernel Smoothing 

V DIF — +/-2SEsT 
0.4 



0.2 



u. 

a 



0 



- 0.2 




-0.4 

0 10 20 30 40 50 60 70 



X 



Figure 4. Sciencel item: Kernel smoothing for conditional differential item functioning 
(DIF) and standard errors (SE). 
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This Differential Item Functioning (DIF) Study 

This DIF study is different from prior DIF studies in that the major focus is on the 
accuracy of estimation strategies’ conditional DIF and conditional standard error estimates, with 
somewhat less emphasis on the accuracy of their overall statistical significance tests and overall 
effect sizes (i.e., standardized E-Dif values; Dorans & Kulick, 1986). As implied in the reviews 
of the DIF estimation strategies of interest (raw data, logistic regression, loglinear models, and 
kernel smoothing), much of the prior research has not compared many of these estimation 
strategies directly to each other and with respect to this study’s conditional DIF focus. What 
studies have been done suggest the following findings from comparisons of the four DIF 
estimation strategies: 

• The estimation strategies may differ more with respect to their conditional DIF estimates 
than with respect to their ability to estimate the same summary DIF measure, the 
standardized E-Dif. This suggestion is based on prior studies that assessed the use of various 
modeling strategies for smoothing raw DIF estimates, which have shown that smoothing 
conditional DIF estimates and then aggregating these estimates into overall DIF measures 
does not improve overall DIF measures relative to simply using the raw data (Douglas et al., 
1996; Puhan, Moses, Yu, & Dorans, 2007). 

• An important issue in comparing the DIF estimation strategies is assessing them in terms of 
their tradeoff of flexibility to fit a range of conditional DIF curves versus statistical power to 
detect DIF. Specifically, the logistic regression strategy’s imposition of logistic functions 
onto the sample data is a less flexible and less data-adaptive estimation approach than 
loglinear models (Hanson & Feinstein, 1995), kernel smoothing (Ramsay, 1991), and raw 
data. Fogistic regression’s reduced flexibility could result not only in reduced estimation 
accuracy for certain DIF situations, but also in increased statistical power because its overall 
chi-square tests are based on fewer degrees of freedom than that of the loglinear models 
estimation strategy (Appendixes B and C) and perhaps because its use of simpler modeling 
parameterizations produce smaller standard errors for conditional DIF estimates. 

• The kernel smoothing estimation strategy has its own distinguishing features that need to be 
compared with those of the other strategies. The described example showed that the 
conditional DIF and standardized E-Dif based on kernel smoothing differed from those of the 
other estimation strategies. The use of kernel smoothing as an overall statistical significance 
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test is an additional interest, as this issue has received little attention in prior studies and has 
not resulted in an extremely accurate significance test (Douglas et al., 1996). 



Method 

The raw data, logistic regression, loglinear modeling, and kernel smoothing DIF 
estimation strategies were compared in several simulations. Populations for DIF items were 
obtained from large- volume test data, and the DIF statistics computed from the raw population 
data were used as population DIF statistics. From these populations, sample datasets were 
randomly drawn at specific reference and focal group sample sizes. Conditional and overall DIF 
were assessed using each of the four strategies in each of the sample datasets. The accuracies of 
the estimation strategies were studied by averaging their results over 400 replications of sample 
datasets and comparing the averages to the population DIF statistics computed in the raw 
population data. 

Raw Population Data and Their Population Differential Item Functioning (DIF) Statistics 

The study used test data from two large-scale achievement tests as the populations. These 
populations are comprised of test data used to conduct actual DIF analyses, making these 
populations especially useful for realistic evaluations that are relevant for practice. Six 
conditional DIF items were found, three from a 69-item science test and three from an 80-item 
history test. The DIF was based on males and females, a comparison that resulted in large 
reference and focal populations. The science test data consisted of 52,896 examinees, with 
34,336 examinees in the reference group (i.e., male) and 18,560 examinees in the focal group 
(i.e., female). The history test data consisted of 325,250 examinees, with 147,737 examinees in 
the reference group (i.e., male) and 177,513 examinees in the focal group (i.e., female). 

Table 2 presents the population statistics of the six items, including their average item 
scores, point-bi serial correlations with the matching variable, and the standardized average 
reference versus focal difference on the matching variable. Table 2 also shows the items’ 
standardized E-Dif values calculated from the raw population data (used as population DIF 
statistics in this study). Table 2’s summary of the six items shows that these items vary in their 
DIF-relevant characteristics, including different levels of reference versus focal abilities on the 
matching variable (the science vs. history items), easier and more difficult studied items 
(Science3 vs. the other five items), varied correlations with the matching variable, varied 
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magnitudes of DIF (Sciencel and Science2 vs. Science3 and History2), and DIF situations where 
the reference group is favored (Sciencel and Science2) and other DIF situations where the focal 
group is favored (Science3, History 1, History2, and History3). 

Sample Sizes 

Random samples of the reference and focal data were drawn from the population data in 
reference/focal sizes of 2,000/2,000, 2,000/700, 700/700, and 700/200. 

Simulations 

The simulations were conducted to assess the raw, logistic regression, loglinear models, 
and kernel smoothing DIF estimation strategies with respect to their estimation of six different 
items and four reference and focal sample size combinations. For each of the 6 x 4 = 24 
combinations of DIF item and sample size, 400 datasets were randomly drawn from the 
population data. In each of these sample datasets, the four DIF estimation strategies were used to 
estimate the conditional DIF in the raw population data across all M levels of matching variable 
X m (1), to estimate the M conditional standard errors of the conditional DIF estimates (3), to 

conduct significance tests of overall DIF (Appendixes A, B, C, and D), and to estimate the 
overall standardized E-Dif measure (2) in the raw population data. 

Table 2 



Summary of the Raw Population Data for the Six Studied Items (Y) 



Subject & item 
(10 


Average item 
score on Y in 
the combined 
focal and 
reference data 


Point-biserial 
correlation 
between X and 
Y in the 

combined focal 
and reference 
data 


Average 
standardized 
difference on X ; 
(focal-reference) 


Standardized 
E-Dif of V 
based on raw 
data 


Sciencel 


0.69 


0.32 


-0.41 


-0.14 


Science2 


0.75 


0.42 


-0.41 


-0.12 


Science3 


0.23 


0.44 


-0.41 


0.07 


History 1 


0.77 


0.38 


-0.26 


0.10 


History2 


0.92 


0.27 


-0.26 


0.08 


History3 


0.76 


0.40 


-0.26 


0.10 
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The study also considered 24 additional no DIF conditions for the six items and four 
sample sizes. For the no DIF conditions, the studied item’s conditional expected scores 
computed in the combined population reference and focal data were used as population 
parameters for randomly generating the reference and focal studied item responses in each of the 
sample datasets. The data generation for the no DIF conditions is illustrated in following three 
bullets: 

• For the Science 1 item, the expected score of the combined population reference and 
population focal data at X m = 5 was 0.358. For the simulation of no DIF in the Science 1 

item, the Science 1 item scores for the reference and focal data at X m = 5 were created by 
randomly drawing values of either 0 or 1, where the probability of drawing a score of 1 at 
X m = 5 was 0.358. The result of this generation of Sciencel item scores was that the 

expected (population) Sciencel item score at X m = 5 was the same (no DIF) in the reference 
and focal sample data, 0.358. 

• For the Science3 item, the expected score of the combined population reference and 
population focal data at X m =11 was 0.054. For the simulation of no DIF in the Science3 

item, the Science3 item scores for the reference and focal data at X m = 11 were created by 
randomly drawing values of either 0 or 1, where the probability of drawing a score of 1 at 
X m =11 was 0.054. The result of this generation of Science3 item scores was that the 

expected (population) Science3 item score at X m =11 was the same (no DIF) in the reference 
and focal sample data, 0.054. 

• For the History2 item, the expected score of the combined population reference and 
population focal data at X m = 27 was 0.849. For the simulation of no DIF in the History2 

item, the History2 item scores for the reference and focal data at X m = 27 were created by 
randomly drawing values of either 0 or 1, where the probability of drawing a score of 1 at 
X m = 27 was 0.849. The result of this generation of History 2 item scores was that the 

expected (population) History2 item score at X m = 27 was the same (no DIF) in the reference 
and focal sample data, 0.849. 
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For all of the no DIF conditions, the scores for all X m values of all six items were generated in 
the same manner as what was described in the previous three bullets. These scores resulted in 
reference and focal data where the expected (population) DIF was zero for all X m values (1) and 

also zero when aggregated across all X m values (2). This generation of no DIF made it possible 

for the DIF strategies to be assessed in no DIF conditions that preserved the overall 
characteristics of the studied items (i.e., difficulty, point-bi serial correlations) and matching 
variables (score ranges, overall reference, and focal ability differences). 

For each of the 48 total conditions (4 sample sizes X 6 items X DIF vs. no DIF = 48), the 
accuracies of the four DIF estimation strategies’ conditional DIF and conditional standard error 
estimates, overall significance tests, and standardized E-Dif measures were assessed as averaged 
across the 400 replicated datasets and compared with the values computed in the raw population 
data. 

Accuracy measures.. To evaluate the accuracy of the conditional DIF estimates for each 
of the study’s 48 conditions (six studied items, four sample size combinations and DIF vs. no 
DIF conditions), measures were computed from the mean squared error ( MSE ) calculated at 
each of the M levels of the matching variable, X m , 



MSE„ 



=— 7 . (S , , -e ( 

y m, replication m ) 



400 

J_ 

400, 



replication 



- \2 



v (e - e y +(# -e ) 

\ m m ) y m, replication m J 



replication L 

: Bias], +Variance„ 



(4) 



where replication indicates one of the 400 random datasets drawn from one of the population 
distributions at one of the four sample size combinations, 6 m replication is the estimated conditional 

DIF estimate in one of the 400 datasets at X m , Q m is the average of the 400 sample datasets’ 
conditional DIF estimates at X m , and 0 m is the conditional DIF estimate computed in the raw 
population data at X m . 
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The square roots of the squared conditional squared bias and variance in (4) were taken 
and averaged with respect to the M score levels of X m to form average absolute conditional bias 
and average conditional standard deviations, 



Avg. Abs. Conditional Bias = — V J Bias] 

M ^ v 



Avg. Conditional SD = — V ^Variance = — V SD 

m J - ri m 



(5) 

(6) 



To assess the extent to which strategies’ estimated conditional standard errors 
approximated their estimates’ actual variability (i.e., the SD m ’s in (6)), a measure similar to (5) 
was used, 



Avg. Abs. Conditional SE Inaccuracy = 




m 

Avg. Conditional SD 



(7) 



where SE m is the average of a DIF strategy's estimated conditional standard errors across the 
400 replications of an item and sample size combination. 

Alternative summary measures to (5), (6), and (7) would be to average the X m -level 
squared differences or the signed differences rather than the absolute differences, and/or weight 
the X m -level results by a population distribution. The X m -level averaging was done on the 

absolute differences because it oriented the averaging directly on the conditional DIF and 
standard error quantities of interest rather than on the squared values. The averaging of absolute 
differences was desirable also because it produced summaries that were not influenced by the 
cancellation of positive and negative differences. The nonweighting in (5), (6), and (7) was used 
because, in practice, conditional DIF results would potentially be evaluated at score levels not 
necessarily based on where the most data are found. Preliminary evaluations of the results 
showed that the conclusions would not be dramatically altered by using alternative versions of 
(5), (6), and (7), but they would also not be identical to the reported results. Plots of the 
strategies’ estimation results were also created to supplement the summary measures. These plots 
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depicted the biases of the conditional DIF ( 0 m - G m ), and the size and accuracies of the standard 

error estimates ( SE m vs. SD m ) for specific item and sample size combinations of interest. 

To evaluate the DIF strategies’ accuracy in terms of the standardized E-Dif measure, 
accuracy measures were created as the square roots of the squared bias (standardized E-Dif 
absolute bias) and variance (standardized E-Dif SD) parts of its own MSE , 



Standardized E-Dif Absolute Bias = J(&—0'J = V Bias 2 



Standardized E-Dif SD = — X (Replication - 0 f = Variance 

V 'W replication 



( 8 ) 

(9) 



where 0 repUcation is the estimated standardized E-Dif value in one of the 400 datasets, 0 is the 

average of the 400 sample datasets’ standardized E-Dif values, and 6 is the raw data 
standardized E-Dif value computed in the population data. 

The accuracy of the DIF estimation strategies’ overall statistical significance tests was 
also assessed. For this evaluation, a rate was calculated for how often each estimation strategy 
indicated that DIF was statistically significant across the 400 replications of an item and sample 
size condition. When the studied item responses were drawn from the actual male and female 
population data, these rates were power rates (i.e., the rate at which the DIF estimation strategies 
correctly indicated DIF when DIF was in the population). When studied item responses for the 
male and female samples were randomly generated from a common set of conditional expected 
scores, these rates were Type I error rates (i.e., the rate at which the DIF estimation strategies 
incorrectly indicated DIF when DIF was not in the population). The superior strategy in terms of 
statistical significance tests was the one that had the largest power rate while staying sufficiently 
close to the desired 0.05 Type I error rate, where sufficient was defined as within a range of 
0.025 to 0.075. This range is known as Bradley’s (1978) liberal criterion of robustness and is 
commonly used to evaluate statistical strategies’ Type I error rates (e.g., Keselman, Wilcox, 
Othman, & Fradette, 2002). 
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Results 

Differential Item Functioning (DIF) Estimation Strategies’ Conditional DIF and Standard 
Error Results 

The results of DIF estimation strategies’ conditional DIF and standard error estimates arc 
summarized by studied item (measures are averaged across the 4X2=8 combinations of sample 
size and DIF vs. no DIF; Table 3), by sample size (measures are averaged across the 6X2=12 
combinations of studied item and DIF vs. no DIF; Table 4) and by DIF versus no DIF (measures 
are averaged across the 6X4=24 combinations of studied item and sample size; Table 5). Each of 
these tables compares the four estimation strategies in terms of the extent to which their 
conditional DIF estimates systematically deviated from the population conditional DIF values 
(average, absolute conditional bias, or avg. abs. conditional bias), the variability of their 
conditional DIF estimates (average conditional standard deviation, or avg. conditional SD), and 
the accuracy of their conditional standard errors (average absolute conditional standard error 
inaccuracy, or avg. abs. conditional SE inaccuracy). The values of absolute bias, variability, and 
standard error accuracy for specific items, sample sizes, and DIF condition are bolded to indicate 
the best DIF estimation strategy and underlined to indicate the worst DIF estimation strategy. 

The DIF estimation strategies produced mixed results in terms of their absolute 
conditional bias for the items, sample sizes, and DIF conditions. The raw data strategy had the 
smallest absolute conditional biases for the three science items, while the logl inear models 
strategy had the smallest values for the History 1 item and the logistic regression strategy had the 
smallest values for the History2 and History3 items. The kernel smoothing strategy had the 
largest absolute conditional biases for four of the six studied items. In terms of sample sizes, the 
logistic regression strategy had the smallest absolute conditional bias for the smallest sample size 
condition considered (700/200), while the raw data strategy had the smallest absolute conditional 
biases and the kernel smoothing strategy had the largest absolute conditional biases for the three 
larger sample size conditions (700/700, 2,000/700, and 2,000/2,000). For the no DIF conditions, 
the logistic regression strategy had the smallest absolute conditional bias and the raw data 
strategy had the largest absolute conditional bias. For the DIF conditions, the raw data strategy 
had the smallest absolute conditional bias while the kernel smoothing strategy had the largest 
absolute conditional bias. 
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Table 3 

The Four Differential Item Functioning (DIF) Estimation Strategies’ Conditional DIF and Standard Error (SE) Results by Item 





Avg. abs. conditional bias 




Avg. conditional SD 




Avg. abs. conditional SE inaccuracy 


Items 


Raw 


Logistic 


Loglinear 


Kernel 


Raw 


Logistic 


Loglinear 


Kernel 


Raw 


Logistic 


Loglinear 


Kernel 


Science 1 


0.013 


0.023 


0.021 


0.027 


0.214 


0.034 


0.074 


0.042 


0.413 


0.062 


0.093 


0.143 


Science2 


0.008 


0.015 


0.019 


0.026 


0.160 


0.025 


0.062 


0.032 


0.405 


0.048 


0.086 


0.131 


Science3 


0.011 


0.023 


0.019 


0.024 


0.188 


0.033 


0.070 


0.043 


0.379 


0.062 


0.106 


0.174 


History 1 


0.023 


0.017 


0.016 


0.018 


0.205 


0.032 


0.083 


0.041 


0.451 


0.041 


0.084 


0.166 


History2 


0.020 


0.010 


0.013 


0.013 


0.177 


0.036 


0.079 


0.033 


0.545 


0.088 


0.082 


0.201 


History3 


0.014 


0.011 


0.014 


0.017 


0.196 


0.031 


0.076 


0.041 


0.428 


0.065 


0.082 


0.161 



Note. The best strategy in terms of absolute bias, standard deviation, and standard error inaccuracy for each item is bolded while the 
worst strategy is underlined. Avg. abs. = average absolute, SD = standard deviation, SE = standard error. 



Table 4 

The Four Differential Item Functioning (DIF) Estimation Strategies’ Conditional DIF and Standard Error (SE) Results by 
Sample Size 



Sample 
sizes (R/F) 


Avg. abs. conditional bias 




Avg. conditional SD 




Avg. abs. conditional SE inaccuracy 


Raw 


Logistic 


Loglinear 


Kernel 


Raw 


Logistic 


Loglinear 


Kernel 


Raw 


Logistic 


Loglinear 


Kernel 


700/200 


0.023 


0.017 


0.024 


0.023 


0.258 


0.049 


0.114 


0.057 


0.533 


0.049 


0.113 


0.268 


700/700 


0.015 


0.016 


0.014 


0.021 


0.201 


0.032 


0.075 


0.039 


0.445 


0.084 


0.073 


0.145 


2,000/700 


0.012 


0.016 


0.017 


0.021 


0.168 


0.027 


0.063 


0.034 


0.410 


0.049 


0.094 


0.145 


2,000/2.000 


0.010 


0.016 


0.013 


0.019 


0.133 


0.019 


0.044 


0.025 


0.360 


0.062 


0.076 


0.092 



Note. The best strategy in terms of absolute bias, standard deviation, and standard error inaccuracy for each sample size is bolded while 



the worst strategy is underlined. Avg. abs. = average absolute, R/F = reference/focal, SD = standard deviation, SE = standard error. 




Table 5 

The Four Differential Item Functioning (DIF) Estimation Strategies’ Conditional DIF and Standard Error (SE) Results by 
DIF/No DIF Conditions 



Avg. abs. conditional bias Avg. conditional SD Avg. abs. conditional SE inaccuracy 



DIF 


Raw 


Logistic 


Loglinear 


Kernel 


Raw 


Logistic 


Loglinear 


Kernel 


Raw 


Logistic 


Loglinear 


Kernel 


No 


0.015 


0.004 


0.009 


0.011 


0.191 


0.032 


0.074 


0.039 


0.440 


0.048 


0.091 


0.164 


Yes 


0.015 


0.029 


0.025 


0.031 


0.189 


0.032 


0.074 


0.039 


0.434 


0.074 


0.087 


0.160 



Note. The best strategy in terms of absolute bias, standard deviation, and standard error inaccuracy for the DIF conditions is bolded 
while the worst strategy is underlined. Avg. abs. = average absolute, SD = standard deviation, SE = standard error. 




The four DIF estimation strategies were fairly consistent in terms of the variability of 
their conditional DIF estimates (average, conditional standard deviation) across the items (Table 
3), sample sizes (Table 4), and DIF versus no DIF conditions (Table 5). The general result was 
that the most-to-least variable conditional DIF estimates were those based on raw data, loglinear 
models, kernel smoothing, and logistic regression. The raw data estimates were more than twice 
as variable as those of the second most variable loglinear models’ estimates, which in turn were 
usually more than twice as variable as those of the least variable logistic regression’s estimates. 

The four DIF estimation strategies were fairly consistent in terms of the accuracy of their 
conditional standard error estimates (average absolute conditional standard error inaccuracy) 
across the items (Table 3), sample sizes (Table 4), and DIF versus no DIF conditions (Table 5). 
Generally, the most-to-least accurate conditional standard error estimates were those based on 
logistic regression, loglinear models, kernel smoothing, and raw data. 

Plots to further assess the conditional DIF and standard error results. Plots were 
used to examine the estimation strategies’ bias and variability results in detail for a limited 
number of this study’s conditions. These plots focused on the results obtained for the Science 1 
item, the results of which are representative of the plots produced for the other five items. To 
consider the biases of the DIF strategies’ conditional DIF estimates in the no DIF condition, 
Figures 5 and 6 plot the strategies’ conditional biases, where the studied item had no DIF in the 
population, and where the reference and focal datasets were drawn at sample sizes of 700/200 
(Figure 5) and at sample sizes of 2,000/2,000 (Figure 6). For the small sample size condition 
shown in Figure 5, the raw data and loglinear models estimation strategies exhibit their largest 
biases at the highest and lowest scores of X m , while the kernel smoothing estimation strategy 
exhibits small but consistently negative biases throughout many of the low to middle scores of 
X m . The strategies’ conditional biases are generally small when based on large sample sizes 

(Figure 6), though the raw data biases have fluctuations at the high and low scores of X m , the 
loglinear models’ biases are largest at the lowest scores of X m , and the kernel smoothing biases 
are small but consistently negative for many of the low to middle scores of X m . 
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Sciencel Item 

DIF Estimation Strategies' Conditional Biases 
Population DIF = No 

Reference/Focal Sample Sizes = 700/200 



Raw Data — Logistic Regression ■ Loglinear Models Kernel Smoothing | 




Figure 5. Sciencel item: Differential item functioning (DIF) estimation strategies’ 
conditional biases — population DIF = no, reference/focal sample sizes = 700/200. 



Sciencel Item 

DIF Estimation Strategies' Conditional Biases 
Population DIF = No 

Reference/Focal Sample Sizes = 2,000/2,000 



Raw Data — Logistic Regression Loglinear Models -*- Kernel Smoothing 




Figure 6. Sciencel item: Differential item functioning (DIF) estimation strategies’ 
conditional biases — population DIF = no, reference/focal sample sizes =2,000/2,000. 
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To consider the estimation strategies’ biases in conditions where the studied item had 
DIF in the population, Figures 7 and 8 plot the strategies’ conditional biases where the Sciencel 
item had DIF in the population and where the reference and focal datasets were drawn at sample 
sizes of 700/200 (Figure 7) and at sample sizes of 2,000/2,000 (Figure 8). The bias results in 
Figures 7 and 8 are very erratic, due in large part to the fluctuations in the population conditional 
DIF (Figure 1). The major results require close inspection of the figures and show that the raw 
data estimates are generally less biased than those of the other three DIF estimation strategies, 
particularly at the highest and lowest scores of X m . The loglinear models’ estimation strategy 

produced conditional biases that were less accurate than those of the logistic regression 
estimation strategy for the small sample size condition (Figure 7) and more accurate than those 
of the logistic regression estimation strategy for the large sample size condition (Figure 8). The 
kernel smoothing biases deviated from the zero line to a larger extent than the biases based on 
raw data, loglinear models, and logistic regression estimation strategies. 

Sciencel Item 

DIF Estimation Strategies' Conditional Biases 
Population DIF = Yes 
Reference/Focal Sample Sizes = 700/200 



Raw Data — Logistic Regression Loglinear Models Kernel Smoothing 




Figure 7. Sciencel item: Differential item functioning (DIF) estimation strategies’ 
conditional biases — population DIF = yes, reference/focal sample sizes =700/200. 
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Sciencel Item 

DIF Estimation Strategies' Conditional Biases 
Population DIF = Yes 

Reference/Focal Sample Sizes = 2,000/2,000 



Raw Data — Logistic Regression ■ Loglinear Models Kernel Smoothing | 




Figure 8. Sciencel item: Differential item functioning (DIF) estimation strategies’ 
conditional biases — population DIF = yes, reference/focal sample sizes = 2,000/2,000. 



To evaluate the DIF strategies’ variabilities and the accuracies of their estimated standard 
errors, Figures 9 to 12 plot the strategies’ conditional SD m and SE m values obtained from the 
Sciencel item based on reference/focal sample sizes of 700/200 and 2,000/2,000. The major 
results shown in these plots are that the standard error estimates get smaller and more accurate 
with larger sample sizes. The standard error estimates based on 700/200 sample sizes using the 
raw data strategy (Figure 9) are particularly inaccurate in that they underestimate actual 
variability (i.e., the SD m ’s in (6)) for the majority of the X m scores. The standard error estimates 

of the logistic regression (Figure 10), loglinear models (Figure 11), and kernel smoothing (Figure 
12) estimation strategies are smaller, smoother, and more accurate than those based on raw data. 
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Sciencel Item 

Conditional Standard Error Estimates 
Raw Data 



SDm, 700/200 SEm, 700/200 — SDm, 2,000/2,000 — SEm, 2,000/2,000 




Figure 9. Sciencel item: Raw data for conditional standard error (SE) estimates. 



Sciencel Item 

Conditional Standard Error Estimates 
Logistic Regression 



♦ SDm, 700/200 ■ SEm, 700/200 • SDm, 2,000/2,000 — SEm, 2,000/2,000 




Figure 10. Sciencel item: Logistic regression for conditional standard error (SE) estimates. 
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Sciencel Item 

Conditional Standard Error Estimates 
Loglinear Models 



SDm, 700/200 -*-SEm, 700/200 + SDm, 2,000/2,000 — SEm, 2,000/2,000 




Figure 11. Sciencel item: Loglinear models for conditional standard error (SE) estimates. 



Sciencel Item 

Conditional Standard Error Estimates 
Kernel Smoothing 



SDm, 700/200 -»-SEm, 700/200 —SDm, 2,000/2,000 — SEm, 2,000/2,000 




Figure 12. Sciencel item: Kernel smoothing for conditional standard error (SE) estimates. 
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Differential Item Functioning (DIF) Estimation Strategies and Standardized E-Dif 
Estimation 

The raw, logistic regression, log! inear models, and kernel smoothing DIF estimation 
strategies’ absolute biases and standard deviations in estimating the standardized E-Dif 
measure are shown for each item (Table 6), sample size combination (Table 7), and DIF versus 
no DIF condition (Table 8). The values of absolute bias and variability arc bolded to indicate 
the best DIF estimation strategy and underlined to indicate the worst DIF strategy. In terms of 
absolute bias, the results show small (0.001) and almost identical absolute biases in the 
standardized E-Dif values based on raw data, logistic regression, and logl inear models, and 
larger (greater than 0.010) absolute bias values in the standardized E-Dif values based on 
kernel smoothing. In terms of standard deviations, the standardized E-Dif values based on raw 
data exhibited slightly larger (by at most .002) variability than those based on logistic 
regression, logl inear models, and kernel smoothing. 

Table 6 



The Four Differential Item Functioning (DIF) Estimation Strategies’ Accuracies for the 
Standardized E-Dif by Item 





Standardized E-Dif absolute bias 




Standardized E-Dif SD 




Items 


Raw 


Fogistic 


Logl in ear 


Kernel 


Raw 


Logistic 


Loglinear 


Kernel 


Science 1 


0.001 


0.001 


0.001 


0.015 


0.026 


0.025 


0.025 


0.025 


Science2 


0.001 


0.001 


0.001 


0.021 


0.023 


0.023 


0.023 


0.023 


Science3 


0.001 


0.002 


0.001 


0.011 


0.019 


0.018 


0.018 


0.019 


History 1 


0.001 


0.001 


0.001 


0.008 


0.022 


0.021 


0.021 


0.021 


History2 


0.000 


0.000 


0.000 


0.004 


0.015 


0.015 


0.015 


0.014 


History3 


0.001 


0.001 


0.001 


0.009 


0.022 


0.021 


0.021 


0.021 



Note. The best strategy in terms of absolute bias and standard deviation for each item is bolded 
while the worst strategy is underlined. SD = standard deviation. 



Differential Item Functioning (DIF) Strategies’ Type I Error and Power Rates 

To evaluate the four DIF estimation strategies in terms of the accuracies of their overall 
statistical significance tests, Table 9 presents their Type I error (no DIF) and power (DIF) rates 
for the six considered items and Table 10 presents their Type I error rate and power rates for the 
four reference/focal sample sizes. In terms of Type I error, the raw data, logistic regression, and 
log linear models estimation strategies were robust with respect to the 0.025 to 0.075 criterion 
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range, while the kernel smoothing estimation strategy produced consistently inflated Type I error 
rates. The estimation strategies could generally be ordered from most to least powerful as kernel 
smoothing, logistic regression, raw data, and loglinear models. The kernel smoothing estimation 
strategy’s high power rates are not useful due to its inability to sufficiently control Type I error. 
The loglinear models’ estimation strategy had power levels that suffered most in the smallest 
sample size condition (700/200) and had power levels that were very similar to those of the 
logistic regression and raw data strategies with the larger sample size conditions. 



Table 7 

The Four Differential Item Functioning (DIF) Estimation Strategies’ Accuracies for the 
Standardized E-Dif by Sample Size 



Sample sizes 
(R/F) 


Standardized E-Dif absolute bias 




Standardized E-Dif SD 




Raw 


Logistic 


Loglinear 


Kernel 


Raw 


Logistic 


Loglinear 


Kernel 


700/200 


0.001 


0.001 


0.001 


0.012 


0.033 


0.031 


0.031 


0.032 


700/700 


0.000 


0.001 


0.001 


0.012 


0.022 


0.021 


0.021 


0.021 


2,000/700 


0.001 


0.001 


0.001 


0.010 


0.017 


0.017 


0.017 


0.017 


2,000/2,000 


0.001 


0.001 


0.001 


0.010 


0.013 


0.012 


0.012 


0.012 



Note. The best strategy in terms of absolute bias and standard deviation for each sample size is 



bolded while the worst strategy is underlined. R/F = reference/focal; SD = standard deviation. 



Table 8 



The Four Differential Item Functioning (DIF) Estimation Strategies’ Accuracies for the 
Standardized E-Dif by DIF /No DIF 

Standardized E-Dif absolute bias Standardized E-Dif SD 



DIF 


Raw 


Logistic 


Loglinear 


Kernel 


Raw 


Logistic 


Loglinear 


Kernel 


No 


0.001 


0.001 


0.001 


0.012 


0.021 


0.020 


0.020 


0.020 


Yes 


0.001 


0.001 


0.001 


0.011 


0.021 


0.021 


0.021 


0.021 



Note. The best strategy in terms absolute bias and standard deviation each DIF condition is 
bolded while the worst strategy is underlined. SD = standard deviation. 
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Table 9 

The Four Differential Item Functioning (DIF) Estimation Strategies’ Type I Error and Power 



Rates by Item 



DIF 


Items 


Raw 


Logistic 


Loglinear 


Kernel 


No 


Science 1 


0.056 


0.057 


0.069 


0. 1 19 a 


(Type I error) 


Science2 


0.037 


0.053 


0.045 


0.193 3 




Science3 


0.033 


0.054 


0.043 


0. 1 14 a 




History 1 


0.046 


0.054 


0.061 


0.083 a 




History2 


0.043 


0.056 


0.069 


0.078 a 




History3 


0.042 


0.055 


0.057 


0.084 a 


Yes 


Science 1 


0.985 


0.981 


0.968 


0.998 


(Power) 


Science2 


0.959 


0.961 


0.930 


0.996 




Science3 


0.901 


0.929 


0.898 


0.893 




History 1 


0.959 


0.951 


0.923 


0.969 




History2 


0.978 


0.984 


0.961 


0.992 




History3 


0.945 


0.949 


0.918 


0.960 



Note. The most powerful strategy’s power rate is bolded while the least powerful strategy’s 



power rate is underlined. 

a Nonrobust Type I error rates that are outside the 0.025 to 0.075 range. 



Table 10 

The Four Differential Item Functioning (DIF) Estimation Strategies’ Type I Error and Power 



Rates by Sample Size 



DIF 


Sample 
sizes (R/F) 


Raw 


Logistic 


Loglinear 


Kernel 


No 


700/200 


0.052 


0.060 


0.072 


0.103 a 


(Type I Error) 


700/700 


0.046 


0.053 


0.048 


0.093 a 




2,000/700 


0.034 


0.047 


0.055 


0.107 a 




2,000/2,000 


0.040 


0.059 


0.054 


0.144 a 


Yes 


700/200 


0.829 


0.843 


0.750 


0.888 


(Power) 


700/700 


0.990 


0.995 


0.984 


0.985 




2,000/700 


0.999 


0.999 


0.998 


0.998 




2,000/2,000 


1.000 


1.000 


1.000 


1.000 



Note. The most powerful strategy’s power rate is bolded while the least powerful strategy’s 



power rate is underlined. R/F = reference/focal. 
a Nonrobust Type I error rates that are outside the 0.025 to 0.075 range. 
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Discussion 



The perspective of this study is that conditional DIF assessments arc useful for evaluating 
an item’s DIF at a more detailed level than summary significance tests and effect sizes. This 
more detailed level can be important when summary DIF assessments oversummarize an item’s 
extent of DIF or summarize DIF when the summary is not of direct interest. The focus of the 
study was on evaluating the accuracy of four estimation strategies with respect to their 
conditional DIF estimates, with a secondary focus on these estimation strategies’ accuracies in 
estimating a common DIF effect size and their statistical significance tests. 

The overall results suggested that the logistic regression and logl inear models’ strategies 
were the most and second most recommended of the four evaluated DIF estimation strategies. 
The logistic regression estimation strategy was especially useful for estimating conditional DIF 
in small sample sizes and for a powerful statistical significance test of overall DIF. The logl inear 
models’ estimation strategy could approximate the conditional DIF in the population better than 
the logistic regression estimation strategy when the population’s conditional DIF was complex, 
however, it required large sample sizes for its flexibility to outweigh its relatively large standard 
errors and its reduced statistical power. The logl inear models’ estimation strategy offers a wider 
range of parameterizations than logistic regression (Appendix C), where increasing the number 
of parameters in the logl inear models from what was used in this study can approximate data 
even more closely, while decreasing the number of parameters can reduce standard errors and 
perhaps increase statistical power. The decision process for selecting appropriate 
parameterizations in the logl inear models’ strategy can be very extensive (e.g., Hanson & 
Feinstein, 1995). The raw data strategy produced conditional DIF estimates that were relatively 
unbiased with respect to the population’s conditional DIF, but also had high levels of variability 
that cause conditional DIF assessments to elude interpretation for all but the largest sample sizes. 
The raw data, logistic regression, and logl inear models estimated the standardized E-Dif measure 
of overall DIF with almost identical levels of accuracy. 

The performance of kernel smoothing made it the least desirable of the four considered 
DIF estimation strategies. It produced the most biased conditional DIF estimates of the four 
considered estimation strategies, had a significance test with an inflated Type I error rate, and 
was the only strategy with bias levels large enough to reduce the accuracy of the overall 
standardized E-Dif measure to levels of practical concern. The source of kernel smoothing’s 
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inaccuracy is that it smoothes the E(Y m ) ’s separately for the reference and focal groups, meaning 
that the groups’ smoothing parameters and extent to which each of the M levels of E(Y m ) are 

weighted in its weighted averaging process arc a direct function of the groups’ overall and 
conditional sample sizes (Appendix D). When the groups differ in their overall ability, the 
E(Y m ) ’s that are closely fit and strongly smoothed are different across the groups, creating 

inaccuracy in the conditional DIF estimates that inflates bias and Type I error rates. The effect of 
reference and focal group differences on the accuracy of kernel smoothing can be observed in the 
higher Type I error rates, conditional biases, and overall biases of the science items than the 
history items (Tables 3, 6, and 9), as the science items’ data exhibited larger reference and focal 
differences than the history items’ data (Table 2). 

Some follow-up efforts were made to try to improve the application of kernel smoothing 
in DIF assessments; one involved smoothing the raw conditional DIF estimates and another used 
a single weighting function to smooth both the reference and focal E(Y m )’s. These efforts did 

not improve kernel smoothing beyond the version assessed in this study and even created 
additional inaccuracies which would be difficult to address (such as how to deal with one 
group’s missing data at an X m score). 

Future Directions 

Some issues not considered in this study could be the basis of future studies. The current 
study compared the DIF strategies under simple conditions where all of the items making up the 
X m score could be assumed to be non-DIF items. Future studies could evaluate the performance 

of the DIF estimation strategies when used with all items on the test making up X m (including 
Y ) or when used with a data-based purification approach where all of the items on the test are 
evaluated for DIF and then the DIF items are excluded from X m when evaluating Y . Wider 

ranges of reference and focal group sample sizes could also be considered. 

An important extension of this investigation is to the evaluation of conditional DIF in 
polytomous items. The features of polytomous items would likely accentuate the differences 
between the loglinear models and logistic regression strategies. The log! inear models’ strategy 
would require several parameters to model the frequency distributions of each possible score on 
the studied item, probably reducing its statistical power and making model convergence less 
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likely for small and moderate sample sizes. The unconstrained cumulative logits version of 
logistic regression has been demonstrated to have an accurate Type I error and high power as an 
overall significance test (Kristjansson et al., 2005), implying that its conditional DIF estimates 
would be most recommended. 

One DIF situation that could form an important follow-up study is a nonuniform DIF 
situation where the conditional DIF crosses to such an extent that the overall standardized E-Dif 
is close to zero. It may not be likely to find such a situation in practice, and even if found, this 
situation might be more likely explained by sampling variability than by substantive explanation. 
However, an extreme crossing DIF situation could be an important basis for studying the 
differences among the four DIF estimation strategies’ significance tests and null hypotheses. 
Specifically the logistic regression and logl inear models’ strategies explicitly incorporate 
nonuniform DIF into their test statistics, perhaps making them more likely to detect crossing DIF 
than the raw data and kernel smoothed standardized strategies that focus on testing the 
standardized E-Dif. 

Some readers might be more interested in assessing DIF that is defined in terms of an 
expected true score matching variable (Shealy & Stout, 1993) than in terms of an observed score 
matching valuable (1). While the SIB TEST approach to DIF is different from that considered in 
this study, the logistic regression and loglinear models’ estimation strategies have potential to 
work within and improve the SIBTEST procedure. Moses and Miao (2007) have shown that the 
use of loglinear models for estimating conditional DIF rather than raw data provides stability that 
allows the SIBTEST regression correction to work more closely to how it is intended to work. 
The use of loglinear models, and potentially logistic regression models, also avoids and possibly 
improves on the use of data exclusion strategies that have been advocated for the SIBTEST 
procedure (Shealy & Stout). 

A final discussion point is how the DIF criteria used in this study affected how well the 
DIF strategies performed. As stated throughout this study’s Method section, the DIF criteria 
chosen in this study were the DIF values computed from large populations of raw test data. 
Reviewers of this study have expressed concerns that this study’s populations of raw test data 
may have advantaged some of the considered strategies (i.e., raw data, loglinear models) over 
others (i.e., logistic regression). These reviewer concerns can be informed by an awareness that 
comparative studies of DIF methods always require a choice of how the DIF criteria and 
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populations are defined. In prior DIF studies, DIF methods have been compared based on criteria 
and populations ranging from actual test data (Dorans & Holland, 1993; Hanson & Feinstein, 
1995; Lyu et al., 1995; Miller & Spray, 1993; Moses & Miao, 2007; Puhan et al., 2007) to data 
that have been simulated with degrees of nonuniform DIF and with presumed relationships 
between observed scores and latent variables (Douglas et al., 1996; Kristjansson et al., 2005; 
Roussos & Stout, 1996; Shealy & Stout, 1993; Swaminathan & Rogers, 1990). 

Because a choice is required for how criteria and populations are defined in DIF studies, 
justifications of these choices can be useful for interpreting DIF studies, their motivations, and 
their results. The justifications for the current study’s use of DIF values computed from large 
samples of raw test data as DIF criteria arc that 1) large sample DIF criteria are realistic and 
therefore relevant for practice (as stated in this study’s Method section), and 2) all four of the 
considered DIF strategies have been recommended and used to estimate DIF in actual test data 
but have not been extensively compared (as stated in this study’s introduction). Additional 
investigations could be undertaken to address concerns that one or more of this study’s 
considered strategies was disadvantaged by this study’s use of realistic DIF criteria. The 
additional investigations could focus on comparing DIF estimation strategies with respect to 
artificial criteria that directly cater to strategies such as logistic regression (i.e., logistic item 
response functions rather than observed item response functions). 
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Appendix A 

Differential Item Functioning (DIF) Estimates Using Raw Data 

The reference and focal expected scores of (1) and (2) can be estimated as the sample 
means from the raw data 
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The standard error of (2) can be estimated as, 
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(Dorans & Holland, 1993, p. 50). The division of (A3) into (2) has been promoted as a z-test of 
DIF in (2) (e.g., the z-test of the SIB TEST version of the standardized E-Dif is described in 
Shealy & Stout, 1993, p. 169). 
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Appendix B 

Differential Item Functioning (DIF) Estimates Using Logistic Regression 

The application of logistic regression procedures to DIF assessment (French & Miller, 
1996; Jodoin & Gierl, 2001; Kristjansson et al., 2005; Swaminathan & Rogers, 1990) involves 
predicting the probability of a correct response (= 1) on dichotomously-scored Y based on total 
score, X m , and group membership. Logistic models of the separate reference and focal groups’ 
predicted Y s can be estimated and directly used in (1) and (2) as the E(Y) ’s, 
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The (3 terms in the models arc estimated by maximum likelihood. The rightmost expressions of 
(Bl) are matrix expressions helpful for additional derivations, where p‘ is the transposed row 
vector of /?„ and /?, terms, (/?„,/?,), and D m is the mth 2-by-l design matrix containing 1 and 
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Estimates of the variances of the E(Y) ’s for (3) can be computed from (Bl) based on 
differentiating the functions and applying the delta method. When P(Y Rm = 1 1 X m ) is used as 
E(Y Rm ), 
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where the 2-by-2 variance-covariance matrix Var{ P R ) is the negative inverse of the second 
derivatives of the P(Y Rm = 1 1 X m ) model’s loglikelihood function with respect to the model’s 
parameters, P R , when the maximum likelihood algorithm converges (Rao, 1966). The estimation 
of Var(E(Y Fm )) is similar. 

The logistic regression’s overall significance test is based on modeling the probability of 
a correct response (=1) on Y using both the reference and focal data in overall models with total 
score X m , a dichotomously-coded group membership variable. G m , and the interaction of group 

membership and X m , X m G m . One model allows for DIF by expressing the separate reference 
and focal models in (Bl) in an overall model, 



P(Y m =UX m ,G m ,X m GJ 
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Another model constrains Y s DIF to be zero in the reference and focal data, 



P(Y m = l\X m ) = 



1 

l + 



(B4) 



Model (B3) is a nonuniform DIF model that models Y based partly on constant reference and 
focal group differences ( j3 2 G m ) across X m and partly on reference and focal group 

differences that are allowed to vary with X m ( fi 3 X m G m ). The logistic framework provides its 

own significance test for nonuniform DIF using the likelihood ratio test comparing models 
(B3) and (B4), 



j 2 = -2(ln L(M B4 )~ In L(M B3 )) 



(B5) 



where In L{M B4 ) is the maximized loglikelihood for model (B4), 

In L(M„ ) = X , In P(Y m = 1 1 X . ) + „ In P(Y m = 0 1 Xj) 

m , (B6) 
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and n R+F m , and n R+F m 0 arc the numbers of reference and focal examinees at score X m that 

obtain 1 and 0 on Y, respectively. In L{M H .) is defined similarly. The statistic in (B5) is chi- 

square distributed with degrees of freedom equal to the difference in the degrees of freedom for 
models (B3) and (B4), or 2. 
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Appendix C 

Differential Item Functioning (DIF) Estimates Using Loglinear Models 

Loglinear models are used to separately estimate the frequency distributions of the DIF 
matching variable X m for each response category of Y. For the reference examinees who get Y 

correct (=1), the frequency distribution of X m can be modeled as, 

i »<w,)=A+ZAx; 

(Cl) 

where s RmY=l is the expected (not actual) frequency of reference examinees who get Y correct and 
obtain score X m and the /3 terms arc estimated using maximum likelihood (Holland & Thayer, 
2000). The V is chosen by the modeler and must be less than the total number of scores on X m , 
M. The maximum likelihood estimation of model (Cl) produces a smoothed frequency 
distribution s RmY=l , where the first V moments (mean, variance, skewness, etc.) match those of 

the observed frequency distribution, n RmY=l . V is set at 4 for all models and conditions of this 
study. 

E(Y ) X 

The v 1,1 ’s are computed based on the separate modeling of four m frequency 

distributions, SltmY = i , s r™y= o ; s Fmr = i anc i s F m r=o ? w jth loglinear models such as (Cl), 
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The E(Y m ) ’s from (C2) are used in (1) and (2). 

Estimates of the variances of the E(Y) ’s for (3) can be computed from (C2) based on the 
delta method. For E(Y Rm ) , 
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where Var(s /{my=1 ) is obtained from the s fimY=1 model’s results and is the mth diagonal entry of 
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^s l(V=l -DIAG s _ -A s 's ry= 1 s ry=1 , DIAG s is the diagonalized matrix of s RY=1 , D RY=1 is an 
M+ 1-by-V design matrix containing all of the s RmY=l model’s X' m terms, and Var(P RY=1 ) is the 
negative inverse of the second derivatives of the s RmY=l model’s loglikelihood function with 
respect to the model’s parameters, P RY=1 , when the maximum likelihood algorithm converges 
(Holland & Thayer, 2000). The estimation of Var(s RmY = 0 ) is similar to that of Var(s RmY=l ) . The 
estimation of Var(E(Y Fm )) is similar to that of Var(E(Y Rm )) . 

Overall models of the X m frequency distributions of the focal and reference data for the 
two possible scores on Y can be fit to create statistical significance tests of Y s DIF. Let G m be a 
dichotomously coded indicator of focal or reference group membership and let Y m indicate the 
obtained score on Y, where both levels of G m and Y m appeal - for all levels of X m . Two models 
considered in this study are a nonuniform DIF model that combines all of the independently 
modeled s RmY=l , s RmY=0 , s FmY=1 and s FmY=0 distributions of form (Cl) into an overall model, 
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and a non-DIF model, 
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(C5) 



ln (-W) = 

a +2X*: + p Y y m +/J c G m +f J j3 XJ ,xx l +Sa iG /:g, 

V=1 V— 1 V=1 

Model (C5) does not contain (C4)’s terms that allow for uniform DIF that is constant across the 
X m categories, ji v c Y m G m , and nonuniform DIF that allows DIF to vary across the X m 

v 

categories, v X' m Y m G m . There arc many variations on these two models for assessing 

V— 1 

DIF, and some of the implications of using other models are described in the Discussion section. 

A significance test of DIF can be computed by comparing the loglikelihoods of models 
(C4) and (C5), 

j 2 = -2(ln L(M C5 ) - in L(M C4 )) , (C6) 

where in L(M C5 ) is the maximized loglikelihood for model (C5), 
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The statistic in (C6) is chi-square distributed with degrees of freedom equal to the difference in 
the degrees of freedom for models (C5) and (C4), or V + 1. 
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Appendix D 

Differential Item Functioning (DIF) Estimates Using Kernel Smoothing 

Kernel smoothing computes kernel-smoothed E(Y m )’ s as moving and weighted averages 
of the raw E(Y m ) ’s estimated in (Al). These kernel smoothed expected scores, KSE(Y Rm ) and 
KSE{Y Fm ) , can be used in (1) and (2), 



KSE{Y Rm ) = w Rm E(Y R ) and KSE(Y Fm ) = w Fm E(Y F ) , 



(Dl) 



where the E(Y R ) and E(Y F ) are M row vectors containing each of the raw E(Y m ) ’s, and w Rm 
and w Fm are 1-by-M matrices each containing / = 1 to M kernel weights, w Rm , and w Fm l . The 
kernel weights considered here are Gaussian weights, 
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which are one type of kernel smoothing weights that include and are understood to perform 
similarly to quadratic, uniform, logistic weights (Douglas et al., 1996; Ramsay, 1991). In (D2), 
n Rl is the reference group’s sample size at X i , <j xr is the reference group’s standard deviation 
on X , and h is a kernel bandwidth parameter that determines the extent of smoothing done to the 
E(Y m ) ’s in computing the KSE(Y m ) ’s. Suggestions of default h values are typically based on 
total sample size (e.g., Douglas et al., 1996; Ramsay, 1991, p. 618). In this study h is set at 
1 . 1 A" 2 , where N is the reference group’s total sample size. The kernel weights for the focal 
group, w Fm , , are computed similarly to w Rm , by using the focal group’s conditional and overall 

sample sizes and the focal group’s standard deviation of X. The kernel weights given in (D2) are 
how kernel smoothing is done at ETS to assess item response functions without the use of 
parametric models and also to assess conditional DIF. 
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The variances of (Dl) that can be used in (3) can be computed using the raw conditional 
variances estimated in (A2) and the kernel weighting functions in (D2), 

Var(KSE(Y R J) = w Rm Var(E(Y R ))w‘ Rra and Var(KSE(Y Fm j) = w Fm Var(E(Y F ))w Fm 

(D3) 

In (D3), the Var(E(Y F )) and Var(E(Y R )) are M-by-M matrices containing the M raw 

conditional variances in the diagonal cells and zeros in the other cells. 

The estimate of the standard error for an overall kernel-smoothed standardized E-Dif 
statistic can be obtained by expressing the kernel-smoothed standardized E-Dif statistic based on 
using the kernel- smoothed terms in (Dl) in (2), 




In (D4) and (D5), N F is the total sample size of the focal group, nj is the transposed M-by-1 
vector of the focal group’s observed frequencies at all M score levels of X m , and w F and w R are 
M-by-M matrices containing all M 1-by-M w Rm and w Fm matrices stacked from m = 1 to M. This 
study evaluates the accuracy of a z-test of (D4) based on dividing it by the square root of (D5). 
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